Context: This script constitutes the initial data preprocessing phase within the 02_cleaning.ipynb notebook. The primary objective is to transform raw election data into a structured format suitable for the subsequent predictive modeling of U.S. elections.

In [1]:
import pandas as pd
import numpy as np

# ==========================================
# STEP 1: DATA LOADING
# ==========================================
# Import the raw dataset containing county-level presidential election results (2000-2024).
# The Python engine is specified to ensure robust handling of the separator detection.
file_path = '/Users/jessicabourdouxhe/Desktop/Master 1/Data/Projet /elections-nlp-project/data/raw/election_results/countypres_2000-2024.csv'
df_raw = pd.read_csv(file_path, sep=None, engine='python')

print("‚úÖ File successfully loaded.")
print(f"Detected columns: {df_raw.columns.tolist()}")

# Display the initial dimensions of the dataset to verify import integrity.
print(f"Initial dataset shape: {df_raw.shape}")

# ==========================================
# STEP 2: DATA CLEANING & FILTERING
# ==========================================
# Filter the dataset to retain only the two major political parties (Democrat and Republican).
# This exclusion of minor parties (e.g., Green, Libertarian) focuses the analysis on the primary electoral competition.
df_clean = df_raw[df_raw['party'].isin(['DEMOCRAT', 'REPUBLICAN'])].copy()
print("‚úÖ Filtering complete.")

# Feature Engineering: Calculate 'vote_share' to normalize vote counts.
# This metric represents the proportion of total votes a candidate received within a specific county.
# Normalization allows for comparable analysis across counties with significant population disparities.
df_clean['vote_share'] = df_clean['candidatevotes'] / df_clean['totalvotes']

# ==========================================
# STEP 3: DATA RESHAPING (PIVOTING)
# ==========================================
# Transform the dataset structure from "Long Format" (observation per candidate) 
# to "Wide Format" (observation per county per election year).
# This restructuring aligns the data for machine learning, creating distinct features for Democratic and Republican performance.
df_pivot = df_clean.pivot_table(
    index=['year', 'state_po', 'county_name', 'county_fips'], # Unique identifiers for the observation unit
    columns='party',                                          # Categorical values transformed into column headers
    values='vote_share'                                       # Metric to populate the new fields
).reset_index()

# Flatten the hierarchical column structure resulting from the pivot operation for cleaner access.
df_pivot.columns.name = None

# Impute missing values with 0 to account for counties where a specific party received no recorded votes.
df_pivot = df_pivot.fillna(0)

from pathlib import Path

# ==========================================
# STEP 4: EXPORT & VALIDATION (Votes)
# ==========================================

# 1. Configuration des chemins
current_dir = Path.cwd()
PROJECT_ROOT = current_dir.parent
DATA_DIR = PROJECT_ROOT / 'data'
PROCESSED_DIR = DATA_DIR / 'processed'

# 2. S√©curit√© : Cr√©ation du dossier s'il n'existe pas
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# 3. D√©finition du chemin complet
output_filename = "votes_cleaned_2000_2024.csv"
save_path_votes = PROCESSED_DIR / output_filename

# 4. Export
# Export the processed dataset to a CSV file for use in subsequent modeling phases.
df_pivot.to_csv(save_path_votes, index=False)

print(f"Processing complete. Data saved to '{save_path_votes}'")
print("\nFirst 5 rows of the cleaned dataset:")
print(df_pivot.head())

FileNotFoundError: [Errno 2] No such file or directory: '/Users/jessicabourdouxhe/Desktop/Master 1/Data/Projet /elections-nlp-project/data/raw/election_results/countypres_2000-2024.csv'

We clean and filter the dataset in order to keep only the two major political parties (Democrats and Republicans). We calculate vote share to normalize the vote count. Then we transforme the dataset to from "Observation per candidate" to a "county-year observation". We impute missing values with 0 to account for counties where a specific party received no recorded votes. Finally, we export the clean dataset "votes_cleaned_2000_2024.csv" to /data/processed folder 

Context: This script processes the DP03_economic.csv file within the 02_cleaning.ipynb notebook. It focuses on preparing socio-economic predictors that will serve as independent variables in the predictive model.

In [19]:
import pandas as pd
import numpy as np

# ==========================================
# STEP 1: LOAD DATA
# ==========================================
# Define the file path for the raw socio-economic dataset (Economic Characteristics).
file_path_eco = "/Users/jessicabourdouxhe/Desktop/Master 1/Data/Projet /elections-nlp-project/data/raw/socio-economic/DP03_economic.csv"

# Load the dataset. 'header=0' is specified to correctly identify the initial header row.
df_eco = pd.read_csv(file_path_eco, header=0)

# Remove the second row (index 0 after loading), which typically contains metadata descriptions 
# rather than actual observations in US Census datasets.
df_eco = df_eco.iloc[1:] 

# ==========================================
# STEP 2: VARIABLE SELECTION (DIMENSIONALITY REDUCTION)
# ==========================================
# Select a subset of relevant independent variables to mitigate the risk of overfitting.
# The selection focuses on key economic indicators: Unemployment, Income, Poverty, and Public Sector employment.
# 'DP03_0043PE' refers to the percentage of government workers.

columns_mapping = {
    'GEO_ID': 'fips',
    'NAME': 'county_name',
    'DP03_0009PE': 'unemployment_rate',      # Unemployment rate
    'DP03_0062E': 'median_income',           # Median household income
    'DP03_0128PE': 'poverty_rate',           # Poverty rate
    'DP03_0043PE': 'public_workers_pct'      # (BONUS) Percentage of public sector workers
}

# Create a new dataframe containing only the selected variables and rename them for clarity.
df_eco_clean = df_eco[columns_mapping.keys()].rename(columns=columns_mapping).copy()

# ==========================================
# STEP 3: DATA CLEANING & TYPE CONVERSION
# ==========================================
# Convert economic indicators from object/string types to numeric formats to facilitate statistical analysis.
# 'errors=coerce' converts non-numeric values to NaN (Not a Number).
cols_to_convert = ['unemployment_rate', 'median_income', 'poverty_rate', 'public_workers_pct']

for col in cols_to_convert:
    df_eco_clean[col] = pd.to_numeric(df_eco_clean[col], errors='coerce')

# Standardize the Federal Information Processing Standard (FIPS) codes.
# This extracts the last 5 digits to ensure consistency with other datasets (removing the '0500000US' prefix).
df_eco_clean['fips'] = df_eco_clean['fips'].astype(str).str[-5:]

from pathlib import Path

# ==========================================
# STEP 4: EXPORT PROCESSED DATA (Sentiment Analysis)
# ==========================================

# 1. Configuration des chemins
current_dir = Path.cwd()
PROJECT_ROOT = current_dir.parent
DATA_DIR = PROJECT_ROOT / 'data'
PROCESSED_DIR = DATA_DIR / 'processed'

# 2. S√©curit√© : Cr√©ation du dossier s'il n'existe pas
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# 3. D√©finition du chemin complet
output_name = 'political_sentiment_DETAILED.csv'
save_path_sentiment = PROCESSED_DIR / output_name

# 4. Export
df_pivot.to_csv(save_path_sentiment, index=False)

print(f"\n‚úÖ File '{output_name}' generated successfully!")
print(f"üìç Location: {save_path_sentiment}")

  df_eco = pd.read_csv(file_path_eco, header=0)



‚úÖ File 'political_sentiment_DETAILED.csv' generated successfully!
üìç Location: /Users/jessicabourdouxhe/Desktop/Master 1/Data/Projet /elections-nlp-project/data/processed/political_sentiment_DETAILED.csv


Our DP03_economic represents our economic variables over the studied period. Since this dataset contains many variables, we firstly decide to keep a subset of relevant variables (FIPS number of the county, county name, unemployment rate, median income, poverty rate, public workers). We therefore create a new dataframe containing only the selected variables. Our next step was to convert economic factors from object/strings into numeric format to facilitate the later analysis. We then needed to extract the five last digits of the FIPS code to ensure that the matching can be done with our other datsets which have a "5 digits" FIPS code. We finish by exporting the clean dataset into our /data/processed folder


Context: This script processes the DP02_socio.csv file within the 02_cleaning.ipynb notebook. It isolates specific social demographic indicators‚Äîspecifically educational attainment‚Äîfor use in the predictive model.


In [20]:
import pandas as pd

# ==========================================
# STEP 1: LOAD SOCIAL DATA (DP02)
# ==========================================
# Define the file path for the raw social characteristics dataset (DP02).
# Note: Ensure the path is correctly directed to the local machine environment.
file_path_socio = "/Users/jessicabourdouxhe/Desktop/Master 1/Data/Projet /elections-nlp-project/data/raw/socio-economic/DP02_socio.csv"

# Load the dataset with the header located at the first row (index 0).
df_socio = pd.read_csv(file_path_socio, header=0)

# Exclude the second row (index 0 after load), which contains descriptive metadata 
# rather than statistical observations.
df_socio = df_socio.iloc[1:] 

print(f"Social Data Loaded. Dimensions: {df_socio.shape}")

# ==========================================
# STEP 2: VARIABLE SELECTION (EDUCATION)
# ==========================================
# Feature Selection: Focus on educational attainment as a predictor.
# Variable 'DP02_0068PE' represents the percentage of the population aged 25 years 
# and over who hold a Bachelor's degree or higher.

columns_mapping_socio = {
    'GEO_ID': 'fips',
    'DP02_0068PE': 'education_bachelors_pct'
}

# Create a subset dataframe with the selected education variable and renamed columns.
df_socio_clean = df_socio[columns_mapping_socio.keys()].rename(columns=columns_mapping_socio).copy()

# ==========================================
# STEP 3: DATA CLEANING & STANDARDIZATION
# ==========================================
# Data Type Conversion: Transform the education percentage column from string to numeric 
# format to enable quantitative analysis. Invalid parsing will result in NaN.
df_socio_clean['education_bachelors_pct'] = pd.to_numeric(df_socio_clean['education_bachelors_pct'], errors='coerce')

# Primary Key Standardization: Extract the last 5 characters of the 'fips' code 
# to remove the '0500000US' prefix and ensure consistency with the election dataset.
df_socio_clean['fips'] = df_socio_clean['fips'].astype(str).str[-5:]





from pathlib import Path

# ==========================================
# STEP 4: EXPORT PROCESSED DATA
# ==========================================
current_dir = Path.cwd()
PROJECT_ROOT = current_dir.parent
DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'
FIG_DIR = PROJECT_ROOT / 'figures'


# 2. "Petite s√©curit√©" : Cr√©er le dossier s'il n'existe pas
# parents=True permet de cr√©er "data" si seul "processed" manquait
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# 3. D√©finir le chemin complet du fichier
output_filename_socio = "education_cleaned_2023.csv"
save_path = PROCESSED_DIR / output_filename_socio

# 4. Sauvegarder le fichier CSV
df_socio_clean.to_csv(save_path, index=False)

print(f"Process Complete. File saved: {save_path}")
print(df_socio_clean.head())


  df_socio = pd.read_csv(file_path_socio, header=0)


Social Data Loaded. Dimensions: (3222, 619)
Process Complete. File saved: /Users/jessicabourdouxhe/Desktop/Master 1/Data/Projet /elections-nlp-project/data/processed/education_cleaned_2023.csv
    fips  education_bachelors_pct
1  01001                     28.3
2  01003                     32.8
3  01005                     11.5
4  01007                     11.5
5  01009                     15.6


Our DP02_socio represents our social variables over the studied period. Since this dataset contains many variables, we firstly decide to strictly focus on education (represents the percentage of the population aged 25 years and over who hold a Bachelor's degree or higher) . Our next step was to convert that variable from object/strings into numeric format to facilitate the later analysis. We then needed to extract the five last digits of the FIPS code to ensure that the matching can be done with our other datsets which have a "5 digits" FIPS code. We finish by exporting the clean dataset into our /data/processed folder. 
As an example, we can see that in county 01001, 28.3% of the population aged 25 years and over hold a Bachelor's degree or higher.

Context: This script processes the DP05_demo.csv file within the 02_cleaning.ipynb notebook. It is dedicated to preparing demographic features, specifically age and ethnicity, which are critical components of the socio-economic model.

In [21]:
import pandas as pd

# ==========================================
# STEP 1: LOAD DEMOGRAPHIC DATA (DP05)
# ==========================================
# Define the file path for the raw demographic dataset (DP05).
# Note: This path references the local environment structure.
file_path_demo = "/Users/jessicabourdouxhe/Desktop/Master 1/Data/Projet /elections-nlp-project/data/raw/socio-economic/DP05_demo.csv"

# Load the dataset using the first row as the header.
df_demo = pd.read_csv(file_path_demo, header=0)

# Remove the description row (index 0) to retain only statistical observations.
df_demo = df_demo.iloc[1:]

print(f"Demographic Data Loaded. Dimensions: {df_demo.shape}")

# ==========================================
# STEP 2: VARIABLE SELECTION
# ==========================================
# Selection of key demographic variables relevant to US political behavior analysis:
# - DP05_0018E: Median Age (Age correlates with political spectrum positioning).
# - DP05_0037PE: Percentage White population (White alone).
# - DP05_0038PE: Percentage Black population (Black or African American alone).
# - DP05_0071PE: Percentage Hispanic population (Hispanic or Latino).

columns_mapping_demo = {
    'GEO_ID': 'fips',
    'DP05_0018E': 'median_age',
    'DP05_0037PE': 'white_pct',
    'DP05_0038PE': 'black_pct',
    'DP05_0071PE': 'hispanic_pct'
}

# Subset the dataframe to selected variables and rename columns for consistency.
df_demo_clean = df_demo[columns_mapping_demo.keys()].rename(columns=columns_mapping_demo).copy()

# ==========================================
# STEP 3: DATA CLEANING & STANDARDIZATION
# ==========================================
# Convert demographic indicators to numeric types to facilitate quantitative analysis.
# Errors are coerced to NaN to handle non-numeric artifacts.
cols_to_convert = ['median_age', 'white_pct', 'black_pct', 'hispanic_pct']

for col in cols_to_convert:
    df_demo_clean[col] = pd.to_numeric(df_demo_clean[col], errors='coerce')

# Standardize FIPS codes by extracting the last 5 digits to match the election dataset format.
df_demo_clean['fips'] = df_demo_clean['fips'].astype(str).str[-5:]

from pathlib import Path

# ==========================================
# STEP 4: EXPORT PROCESSED DATA (Demographics)
# ==========================================

# 1. Configuration des chemins (Votre code standardis√©)
current_dir = Path.cwd()
PROJECT_ROOT = current_dir.parent  # Remonte d'un niveau (suppose que le script est dans src/ ou notebooks/)
DATA_DIR = PROJECT_ROOT / 'data'
PROCESSED_DIR = DATA_DIR / 'processed'

# 2. "Petite s√©curit√©" : Cr√©er le dossier s'il n'existe pas
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# 3. D√©finir le chemin complet du fichier
output_filename_demo = "demographics_cleaned_2023.csv"
save_path_demo = PROCESSED_DIR / output_filename_demo

# 4. Sauvegarder le fichier CSV
df_demo_clean.to_csv(save_path_demo, index=False)

print(f"Processing Complete. File saved: {save_path_demo}")
print(df_demo_clean.head())

Demographic Data Loaded. Dimensions: (3222, 379)
Processing Complete. File saved: /Users/jessicabourdouxhe/Desktop/Master 1/Data/Projet /elections-nlp-project/data/processed/demographics_cleaned_2023.csv
    fips  median_age  white_pct  black_pct  hispanic_pct
1  01001        39.2       73.6       20.0           0.8
2  01003        43.7       82.8        8.0           1.6
3  01005        40.7       44.0       46.9           0.9
4  01007        41.3       75.1       20.7           1.9
5  01009        40.9       89.5        1.3           1.7


  df_demo = pd.read_csv(file_path_demo, header=0)


Our DP05_demo represents our demographic variables over the studied period. Since this dataset contains many variables, we firstly decide to select relevent variables: Median Age spectrum positioning, Percentage White population, Percentage Black population (Black or African American alone), Percentage Hispanic population (Hispanic or Latino). Our next step was to convert those variable from object/strings into numeric format to facilitate the later analysis. We then needed to extract the five last digits of the FIPS code to ensure that the matching can be done with our other datsets which have a "5 digits" FIPS code. We finish by exporting the clean dataset into our /data/processed folder. 
As an example, we can see that in county 01001, 39.2% of the population has an age that correlates with the political spectrum positioning, 73,6% of the population is white, 20% is black and 0,8% is hispanic. 

Context: This script represents the culmination of the data preprocessing pipeline in 02_cleaning.ipynb. Its purpose is to aggregate the disparate cleaned datasets‚Äîelection results, economic indicators, educational attainment, and demographic profiles‚Äîinto a single, cohesive Master Table suitable for machine learning.

In [22]:
import pandas as pd

# ==========================================
# STEP 1: LOAD PROCESSED DATA
# ==========================================


# Load all datasets into DataFrames
df_votes = pd.read_csv(path_votes)
df_eco = pd.read_csv(path_eco)
df_edu = pd.read_csv(path_edu)
df_demo = pd.read_csv(path_demo)

# Standardization: Rename the primary key in the votes dataframe to match the socio-economic files.
df_votes = df_votes.rename(columns={'county_fips': 'fips'})

# ==========================================
# üö® STEP 1.5: FIPS STANDARDIZATION & CLEANING (CRITICAL)
# ==========================================
def clean_fips(df):
    """
    Standardizes the Federal Information Processing Standard (FIPS) codes
    to ensure consistent merging keys across all datasets.
    """
    # 1. Coerce to numeric to handle potential float representations (e.g., "1001.0")
    df['fips'] = pd.to_numeric(df['fips'], errors='coerce')
    
    # 2. Remove observations where the FIPS code is missing or invalid
    df = df.dropna(subset=['fips'])
    
    # 3. Convert to integer to eliminate decimal points (e.g., 1001.0 -> 1001)
    df['fips'] = df['fips'].astype(int)
    
    # 4. Convert to string and apply zero-padding to ensure a fixed 5-digit format (e.g., 1001 -> "01001")
    df['fips'] = df['fips'].astype(str).str.zfill(5)
    
    return df

print("--- Standardizing FIPS codes... ---")
df_votes = clean_fips(df_votes)
df_eco   = clean_fips(df_eco)
df_edu   = clean_fips(df_edu)
df_demo  = clean_fips(df_demo)

print(f"‚úÖ Validation: Votes FIPS Example (Expected: 01001) : '{df_votes['fips'].iloc[0]}'")
print(f"‚úÖ Validation: Eco FIPS Example   (Expected: 01001) : '{df_eco['fips'].iloc[0]}'")

# If both outputs confirm '01001' format, the subsequent merge operations will succeed.

# ==========================================
# STEP 2: DATA FUSION (MERGE)
# ==========================================
print("\nMerging datasets...")
# Perform an INNER JOIN between Votes and Economics.
# This intersection ensures we only retain counties present in both critical datasets.
df_master = pd.merge(df_votes, df_eco, on='fips', how='inner') 

# Perform LEFT JOINS for Education and Demographics to preserve the main structure
# even if minor auxiliary data is missing for some counties.
df_master = pd.merge(df_master, df_edu, on='fips', how='left')
df_master = pd.merge(df_master, df_demo, on='fips', how='left')

# ==========================================
# STEP 3: COLUMN CLEANING
# ==========================================
# Identify and remove redundant columns generated during the merge (suffixes like '_y').
# Also remove duplicate identifier columns (e.g., 'county_name_eco').
cols_to_drop = [c for c in df_master.columns if c.endswith('_y') or 'county_name_eco' in c]
df_master = df_master.drop(columns=cols_to_drop)

# Rename columns to remove the '_x' suffix generated by the merge.
df_master.columns = df_master.columns.str.replace('_x', '')

from pathlib import Path

# ==========================================
# STEP 4: FINAL EXPORT (Master Table)
# ==========================================

# 1. Configuration des chemins
current_dir = Path.cwd()
PROJECT_ROOT = current_dir.parent
DATA_DIR = PROJECT_ROOT / 'data'
PROCESSED_DIR = DATA_DIR / 'processed'

# 2. S√©curit√© : Cr√©ation du dossier
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

# 3. Affichage des dimensions
print(f"\nüìä Final Master Table Dimensions: {len(df_master)} rows.")

# 4. Logique de sauvegarde conditionnelle
if len(df_master) > 0:
    # D√©finition du chemin complet
    output_filename = "master_table_elections.csv"
    save_path_master = PROCESSED_DIR / output_filename
    
    # Export
    df_master.to_csv(save_path_master, index=False)
    
    # Message de succ√®s avec le chemin pr√©cis
    print(f"üéâ SAVED! The master file has been populated here: {save_path_master}")
    print(df_master.head())
else:
    print("‚ùå ERROR: Dataset is empty. Deep investigation required.")

--- Standardizing FIPS codes... ---
‚úÖ Validation: Votes FIPS Example (Expected: 01001) : '02001'
‚úÖ Validation: Eco FIPS Example   (Expected: 01001) : '01001'

Merging datasets...

üìä Final Master Table Dimensions: 21742 rows.
üéâ SAVED! The master file has been populated here: /Users/jessicabourdouxhe/Desktop/Master 1/Data/Projet /elections-nlp-project/data/processed/master_table_elections.csv
   year state_po  county_name   fips  DEMOCRAT  REPUBLICAN  unemployment_rate  \
0  2000       AK  DISTRICT 13  02013  0.335012    0.485081                3.8   
1  2000       AK  DISTRICT 16  02016  0.420277    0.422625                4.1   
2  2000       AK  DISTRICT 20  02020  0.324429    0.523912                4.6   
3  2000       AL      AUTAUGA  01001  0.287192    0.696943                2.5   
4  2000       AL      BALDWIN  01003  0.247822    0.723654                3.2   

   median_income  poverty_rate  public_workers_pct  education_bachelors_pct  \
0        72692.0          12.4

After having cleaned all of the individual dataset, it was time to merge all of those socio-demo-eco variables in a single dataset. This is the purpose of the code here above. We exported the merged dataset into our data/processed folder. This dataset is thus ready for ML analysis. 