Context: This script constitutes the initial data preprocessing phase within the 02_cleaning.ipynb notebook. The primary objective is to transform raw election data into a structured format suitable for the subsequent predictive modeling of U.S. elections.

In [1]:
import pandas as pd
import numpy as np
from pathlib import Path

# ==========================================
# 0. CONFIGURATION DES CHEMINS (PATH SETUP)
# ==========================================
# On d√©finit la racine du projet dynamiquement comme dans 01_data_collection
current_dir = Path.cwd()
PROJECT_ROOT = current_dir.parent
DATA_DIR = PROJECT_ROOT / 'data'
RAW_DIR = DATA_DIR / 'raw'
PROCESSED_DIR = DATA_DIR / 'processed'

# Cr√©ation du dossier processed s'il n'existe pas
PROCESSED_DIR.mkdir(parents=True, exist_ok=True)

print(f"üìÇ Project Root: {PROJECT_ROOT}")
print(f"üìÇ Raw Data Dir: {RAW_DIR}")
print(f"üìÇ Processed Data Dir: {PROCESSED_DIR}")

# ==========================================
# STEP 1: DATA LOADING (VOTES)
# ==========================================
# Utilisation du chemin dynamique
file_path_votes = RAW_DIR / 'election_results' / 'countypres_2000-2024.csv'

# V√©rification pour √©viter l'erreur FileNotFoundError
if not file_path_votes.exists():
    raise FileNotFoundError(f"‚ùå Le fichier est introuvable ici : {file_path_votes}")

df_raw = pd.read_csv(file_path_votes, sep=None, engine='python')

print("\n‚úÖ File successfully loaded.")
print(f"Detected columns: {df_raw.columns.tolist()}")
print(f"Initial dataset shape: {df_raw.shape}")

# ==========================================
# STEP 2: DATA CLEANING & FILTERING
# ==========================================
df_clean = df_raw[df_raw['party'].isin(['DEMOCRAT', 'REPUBLICAN'])].copy()
df_clean['vote_share'] = df_clean['candidatevotes'] / df_clean['totalvotes']

# ==========================================
# STEP 3: DATA RESHAPING (PIVOTING)
# ==========================================
df_pivot = df_clean.pivot_table(
    index=['year', 'state_po', 'county_name', 'county_fips'],
    columns='party',
    values='vote_share'
).reset_index()

df_pivot.columns.name = None
df_pivot = df_pivot.fillna(0)

# ==========================================
# STEP 4: EXPORT & VALIDATION
# ==========================================
output_filename = "votes_cleaned_2000_2024.csv"
save_path_votes = PROCESSED_DIR / output_filename

df_pivot.to_csv(save_path_votes, index=False)

print(f"Processing complete. Data saved to '{save_path_votes}'")
print("\nFirst 5 rows of the cleaned dataset:")
print(df_pivot.head())

üìÇ Project Root: /Users/xavierfoidart/Documents/M1/Data Management/Projet/elections-nlp-project
üìÇ Raw Data Dir: /Users/xavierfoidart/Documents/M1/Data Management/Projet/elections-nlp-project/data/raw
üìÇ Processed Data Dir: /Users/xavierfoidart/Documents/M1/Data Management/Projet/elections-nlp-project/data/processed

‚úÖ File successfully loaded.
Detected columns: ['year', 'state', 'state_po', 'county_name', 'county_fips', 'office', 'candidate', 'party', 'candidatevotes', 'totalvotes', 'version', 'mode']
Initial dataset shape: (94409, 12)
Processing complete. Data saved to '/Users/xavierfoidart/Documents/M1/Data Management/Projet/elections-nlp-project/data/processed/votes_cleaned_2000_2024.csv'

First 5 rows of the cleaned dataset:
   year state_po  county_name  county_fips  DEMOCRAT  REPUBLICAN
0  2000       AK   DISTRICT 1       2001.0  0.192909    0.703275
1  2000       AK  DISTRICT 10       2010.0  0.246961    0.638564
2  2000       AK  DISTRICT 11       2011.0  0.292693    0

We clean and filter the dataset in order to keep only the two major political parties (Democrats and Republicans). We calculate vote share to normalize the vote count. Then we transforme the dataset to from "Observation per candidate" to a "county-year observation". We impute missing values with 0 to account for counties where a specific party received no recorded votes. Finally, we export the clean dataset "votes_cleaned_2000_2024.csv" to /data/processed folder 

Context: This script processes the DP03_economic.csv file within the 02_cleaning.ipynb notebook. It focuses on preparing socio-economic predictors that will serve as independent variables in the predictive model.

In [2]:
# ==========================================
# STEP 1: LOAD DATA (ECONOMICS)
# ==========================================
# Chemin dynamique vers le dossier socio-economic
file_path_eco = RAW_DIR / 'socio-economic' / 'DP03_economic.csv'

df_eco = pd.read_csv(file_path_eco, header=0)
df_eco = df_eco.iloc[1:] # Remove description row

# ==========================================
# STEP 2: VARIABLE SELECTION
# ==========================================
columns_mapping = {
    'GEO_ID': 'fips',
    'NAME': 'county_name',
    'DP03_0009PE': 'unemployment_rate',
    'DP03_0062E': 'median_income',
    'DP03_0128PE': 'poverty_rate',
    'DP03_0043PE': 'public_workers_pct'
}

df_eco_clean = df_eco[columns_mapping.keys()].rename(columns=columns_mapping).copy()

# ==========================================
# STEP 3: DATA CLEANING
# ==========================================
cols_to_convert = ['unemployment_rate', 'median_income', 'poverty_rate', 'public_workers_pct']
for col in cols_to_convert:
    df_eco_clean[col] = pd.to_numeric(df_eco_clean[col], errors='coerce')

# Standardize FIPS (last 5 digits)
df_eco_clean['fips'] = df_eco_clean['fips'].astype(str).str[-5:]

# ==========================================
# STEP 4: EXPORT PROCESSED DATA
# ==========================================
output_name_eco = 'political_sentiment_DETAILED.csv'
save_path_sentiment = PROCESSED_DIR / output_name_eco

df_eco_clean.to_csv(save_path_sentiment, index=False)

print(f"\n‚úÖ File '{output_name_eco}' generated successfully!")
print(f"üìç Location: {save_path_sentiment}")


‚úÖ File 'political_sentiment_DETAILED.csv' generated successfully!
üìç Location: /Users/xavierfoidart/Documents/M1/Data Management/Projet/elections-nlp-project/data/processed/political_sentiment_DETAILED.csv


  df_eco = pd.read_csv(file_path_eco, header=0)


Our DP03_economic represents our economic variables over the studied period. Since this dataset contains many variables, we firstly decide to keep a subset of relevant variables (FIPS number of the county, county name, unemployment rate, median income, poverty rate, public workers). We therefore create a new dataframe containing only the selected variables. Our next step was to convert economic factors from object/strings into numeric format to facilitate the later analysis. We then needed to extract the five last digits of the FIPS code to ensure that the matching can be done with our other datsets which have a "5 digits" FIPS code. We finish by exporting the clean dataset into our /data/processed folder


Context: This script processes the DP02_socio.csv file within the 02_cleaning.ipynb notebook. It isolates specific social demographic indicators‚Äîspecifically educational attainment‚Äîfor use in the predictive model.


In [3]:
# ==========================================
# STEP 1: LOAD SOCIAL DATA (DP02)
# ==========================================
file_path_socio = RAW_DIR / 'socio-economic' / 'DP02_socio.csv'

df_socio = pd.read_csv(file_path_socio, header=0)
df_socio = df_socio.iloc[1:]

print(f"Social Data Loaded. Dimensions: {df_socio.shape}")

# ==========================================
# STEP 2: VARIABLE SELECTION (EDUCATION)
# ==========================================
columns_mapping_socio = {
    'GEO_ID': 'fips',
    'DP02_0068PE': 'education_bachelors_pct'
}

df_socio_clean = df_socio[columns_mapping_socio.keys()].rename(columns=columns_mapping_socio).copy()

# ==========================================
# STEP 3: CLEANING
# ==========================================
df_socio_clean['education_bachelors_pct'] = pd.to_numeric(df_socio_clean['education_bachelors_pct'], errors='coerce')
df_socio_clean['fips'] = df_socio_clean['fips'].astype(str).str[-5:]

# ==========================================
# STEP 4: EXPORT
# ==========================================
output_filename_socio = "education_cleaned_2023.csv"
save_path_edu = PROCESSED_DIR / output_filename_socio

df_socio_clean.to_csv(save_path_edu, index=False)

print(f"Process Complete. File saved: {save_path_edu}")
print(df_socio_clean.head())


Social Data Loaded. Dimensions: (3222, 619)
Process Complete. File saved: /Users/xavierfoidart/Documents/M1/Data Management/Projet/elections-nlp-project/data/processed/education_cleaned_2023.csv
    fips  education_bachelors_pct
1  01001                     28.3
2  01003                     32.8
3  01005                     11.5
4  01007                     11.5
5  01009                     15.6


  df_socio = pd.read_csv(file_path_socio, header=0)


Our DP02_socio represents our social variables over the studied period. Since this dataset contains many variables, we firstly decide to strictly focus on education (represents the percentage of the population aged 25 years and over who hold a Bachelor's degree or higher) . Our next step was to convert that variable from object/strings into numeric format to facilitate the later analysis. We then needed to extract the five last digits of the FIPS code to ensure that the matching can be done with our other datsets which have a "5 digits" FIPS code. We finish by exporting the clean dataset into our /data/processed folder. 
As an example, we can see that in county 01001, 28.3% of the population aged 25 years and over hold a Bachelor's degree or higher.

Context: This script processes the DP05_demo.csv file within the 02_cleaning.ipynb notebook. It is dedicated to preparing demographic features, specifically age and ethnicity, which are critical components of the socio-economic model.

In [4]:
# ==========================================
# STEP 1: LOAD DEMOGRAPHIC DATA (DP05)
# ==========================================
file_path_demo = RAW_DIR / 'socio-economic' / 'DP05_demo.csv'

df_demo = pd.read_csv(file_path_demo, header=0)
df_demo = df_demo.iloc[1:]

print(f"Demographic Data Loaded. Dimensions: {df_demo.shape}")

# ==========================================
# STEP 2: VARIABLE SELECTION
# ==========================================
columns_mapping_demo = {
    'GEO_ID': 'fips',
    'DP05_0018E': 'median_age',
    'DP05_0037PE': 'white_pct',
    'DP05_0038PE': 'black_pct',
    'DP05_0071PE': 'hispanic_pct'
}

df_demo_clean = df_demo[columns_mapping_demo.keys()].rename(columns=columns_mapping_demo).copy()

# ==========================================
# STEP 3: CLEANING
# ==========================================
cols_to_convert = ['median_age', 'white_pct', 'black_pct', 'hispanic_pct']
for col in cols_to_convert:
    df_demo_clean[col] = pd.to_numeric(df_demo_clean[col], errors='coerce')

df_demo_clean['fips'] = df_demo_clean['fips'].astype(str).str[-5:]

# ==========================================
# STEP 4: EXPORT
# ==========================================
output_filename_demo = "demographics_cleaned_2023.csv"
save_path_demo = PROCESSED_DIR / output_filename_demo

df_demo_clean.to_csv(save_path_demo, index=False)

print(f"Processing Complete. File saved: {save_path_demo}")
print(df_demo_clean.head())

Demographic Data Loaded. Dimensions: (3222, 379)
Processing Complete. File saved: /Users/xavierfoidart/Documents/M1/Data Management/Projet/elections-nlp-project/data/processed/demographics_cleaned_2023.csv
    fips  median_age  white_pct  black_pct  hispanic_pct
1  01001        39.2       73.6       20.0           0.8
2  01003        43.7       82.8        8.0           1.6
3  01005        40.7       44.0       46.9           0.9
4  01007        41.3       75.1       20.7           1.9
5  01009        40.9       89.5        1.3           1.7


  df_demo = pd.read_csv(file_path_demo, header=0)


Our DP05_demo represents our demographic variables over the studied period. Since this dataset contains many variables, we firstly decide to select relevent variables: Median Age spectrum positioning, Percentage White population, Percentage Black population (Black or African American alone), Percentage Hispanic population (Hispanic or Latino). Our next step was to convert those variable from object/strings into numeric format to facilitate the later analysis. We then needed to extract the five last digits of the FIPS code to ensure that the matching can be done with our other datsets which have a "5 digits" FIPS code. We finish by exporting the clean dataset into our /data/processed folder. 
As an example, we can see that in county 01001, 39.2% of the population has an age that correlates with the political spectrum positioning, 73,6% of the population is white, 20% is black and 0,8% is hispanic. 

Context: This script represents the culmination of the data preprocessing pipeline in 02_cleaning.ipynb. Its purpose is to aggregate the disparate cleaned datasets‚Äîelection results, economic indicators, educational attainment, and demographic profiles‚Äîinto a single, cohesive Master Table suitable for machine learning.

In [5]:
# ==========================================
# STEP 1: LOAD PROCESSED DATA
# ==========================================
# On red√©finit les chemins d'entr√©e bas√©s sur PROCESSED_DIR pour √™tre s√ªr
path_votes_clean = PROCESSED_DIR / "votes_cleaned_2000_2024.csv"
path_eco_clean   = PROCESSED_DIR / "political_sentiment_DETAILED.csv"
path_edu_clean   = PROCESSED_DIR / "education_cleaned_2023.csv"
path_demo_clean  = PROCESSED_DIR / "demographics_cleaned_2023.csv"

# Chargement
df_votes = pd.read_csv(path_votes_clean)
df_eco   = pd.read_csv(path_eco_clean)
df_edu   = pd.read_csv(path_edu_clean)
df_demo  = pd.read_csv(path_demo_clean)

# Standardization: Rename key in votes
df_votes = df_votes.rename(columns={'county_fips': 'fips'})

# ==========================================
# üö® STEP 1.5: FIPS STANDARDIZATION & CLEANING
# ==========================================
def clean_fips(df):
    """Standardizes FIPS to 5-digit string format."""
    df['fips'] = pd.to_numeric(df['fips'], errors='coerce')
    df = df.dropna(subset=['fips'])
    df['fips'] = df['fips'].astype(int)
    df['fips'] = df['fips'].astype(str).str.zfill(5)
    return df

print("--- Standardizing FIPS codes... ---")
df_votes = clean_fips(df_votes)
df_eco   = clean_fips(df_eco)
df_edu   = clean_fips(df_edu)
df_demo  = clean_fips(df_demo)

print(f"‚úÖ Validation: Votes FIPS Example : '{df_votes['fips'].iloc[0]}'")
print(f"‚úÖ Validation: Eco FIPS Example   : '{df_eco['fips'].iloc[0]}'")

# ==========================================
# STEP 2: DATA FUSION (MERGE)
# ==========================================
print("\nMerging datasets...")
# Inner Join: Votes + Eco (Must have both)
df_master = pd.merge(df_votes, df_eco, on='fips', how='inner') 

# Left Join: + Education + Demo (Keep observation even if data missing)
df_master = pd.merge(df_master, df_edu, on='fips', how='left')
df_master = pd.merge(df_master, df_demo, on='fips', how='left')

# ==========================================
# STEP 3: COLUMN CLEANING
# ==========================================
cols_to_drop = [c for c in df_master.columns if c.endswith('_y') or 'county_name_eco' in c]
df_master = df_master.drop(columns=cols_to_drop)
df_master.columns = df_master.columns.str.replace('_x', '')

# ==========================================
# STEP 4: FINAL EXPORT (Master Table)
# ==========================================
output_filename_master = "master_table_elections.csv"
save_path_master = PROCESSED_DIR / output_filename_master

if len(df_master) > 0:
    df_master.to_csv(save_path_master, index=False)
    print(f"üéâ SAVED! The master file has been populated here: {save_path_master}")
    print(df_master.head())
else:
    print("‚ùå ERROR: Dataset is empty. Deep investigation required.")

--- Standardizing FIPS codes... ---
‚úÖ Validation: Votes FIPS Example : '02001'
‚úÖ Validation: Eco FIPS Example   : '01001'

Merging datasets...
üéâ SAVED! The master file has been populated here: /Users/xavierfoidart/Documents/M1/Data Management/Projet/elections-nlp-project/data/processed/master_table_elections.csv
   year state_po  county_name   fips  DEMOCRAT  REPUBLICAN  unemployment_rate  \
0  2000       AK  DISTRICT 13  02013  0.335012    0.485081                3.8   
1  2000       AK  DISTRICT 16  02016  0.420277    0.422625                4.1   
2  2000       AK  DISTRICT 20  02020  0.324429    0.523912                4.6   
3  2000       AL      AUTAUGA  01001  0.287192    0.696943                2.5   
4  2000       AL      BALDWIN  01003  0.247822    0.723654                3.2   

   median_income  poverty_rate  public_workers_pct  education_bachelors_pct  \
0        72692.0          12.4                 1.9                     18.1   
1       107344.0           9.2    

After having cleaned all of the individual dataset, it was time to merge all of those socio-demo-eco variables in a single dataset. This is the purpose of the code here above. We exported the merged dataset into our data/processed folder. This dataset is thus ready for ML analysis. 