# Data Preprocessing

The goal of this notebook is to prepare both the **historical dataset (2005–2023)** and the **2024 snapshot dataset** for further analysis and modeling.  
While the exploratory data analysis (EDA) provided first descriptive insights, this stage ensures data consistency, handles structural issues, and creates a clean foundation for answering our research questions.  

Key preprocessing objectives:
- Address missingness where relevant (e.g., for regression modeling)
- Harmonize country names and ensure consistent regional mapping across datasets  
- Verify variable selection and clarify which features will be used for historical vs. 2024 analyses    


## 1. Load Datasets

In [1]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [2]:
path_2024 = "../data/raw/world-happiness-report-2024-yearly-updated/World-happiness-report-2024.csv"
path_all_years = "../data/raw/world-happiness-report-2024-yearly-updated/World-happiness-report-updated_2024.csv"

# Load datasets
df_2024 = pd.read_csv(path_2024) 
df_all_years = pd.read_csv(path_all_years, encoding='latin1')

# Quick sanity check
print('df_2024 shape:', df_2024.shape)
print('df_all_years shape:', df_all_years.shape)

df_2024 shape: (143, 12)
df_all_years shape: (2363, 11)


## 2. Handling Missing Values & Data Cleaning  

As established in the EDA, missingness is **limited but patterned**:  
- 2005 has very few observations → excluded from historical analysis.  
- Three countries in the 2024 dataset (Bahrain, Tajikistan, Palestine) have missing predictors but valid ladder scores → retained for descriptive analyses, excluded from regression modeling.  

No further imputation is required, as coverage is high and missingness is not random.  
We now apply these cleaning steps directly to the datasets.


In [3]:
# Dropping 2005 data due to low coverage
df_all_years_clean = df_all_years[df_all_years['year'] != 2005].copy()

# Quick sanity check
print('df_all_years_clean shape (2005 removed):', df_all_years_clean.shape) # 27 records removed as expected

df_all_years_clean shape (2005 removed): (2336, 11)


In [4]:
# Dropping missing values in 2024 for regression modeling
df_2024_model = df_2024.dropna(axis=0, subset=['Log GDP per capita']) 

# Quick sanity check
print('df_2024_model shape (NaN values removed):', df_2024_model.shape) # 3 rows dropped as expected

df_2024_model shape (NaN values removed): (140, 12)


## 3. Column Harmonization  

To ensure consistency between the historical dataset and the 2024 dataset, we standardize column names.  
This step prevents mismatches during analysis and guarantees that identical variables carry the same labels across datasets.  


In [5]:
# Rename columns in historical dataset to align with 2024 dataset
df_all_years_clean = df_all_years_clean.rename(columns={
    'Life Ladder': 'Ladder score',
    'Healthy life expectancy at birth': 'Healthy life expectancy',
})

# Quick check
print("Historical dataset columns:", df_all_years_clean.columns.tolist())
print("2024 dataset columns:", df_2024.columns.tolist())


Historical dataset columns: ['Country name', 'year', 'Ladder score', 'Log GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Positive affect', 'Negative affect']
2024 dataset columns: ['Country name', 'Regional indicator', 'Ladder score', 'upperwhisker', 'lowerwhisker', 'Log GDP per capita', 'Social support', 'Healthy life expectancy', 'Freedom to make life choices', 'Generosity', 'Perceptions of corruption', 'Dystopia + residual']


## 4. Region Mapping

In order to also do regional analyses in the historic dataset, we map the Regional Indicator from the 2024 datset with the historic data based on country names.

In [6]:
# Match country names between datasets (for merging Regional indicator column later)

# Count unique countries in both datasets and identify non-matching entries
set_hist = set(df_all_years_clean['Country name'])
set_2024 = set(df_2024['Country name'])

len_hist, len_2024 = len(set_hist), len(set_2024)
print('Unique countries: hist=', len_hist, ' 2024=', len_2024)

only_in_hist = sorted(list(set_hist - set_2024))
only_in_2024 = sorted(list(set_2024 - set_hist))

print('In historical only:', only_in_hist)
print('In 2024 only:', only_in_2024)


Unique countries: hist= 165  2024= 143
In historical only: ['Angola', 'Belarus', 'Belize', 'Bhutan', 'Burundi', 'Central African Republic', 'Cuba', 'Djibouti', 'Guyana', 'Haiti', 'Maldives', 'Oman', 'Qatar', 'Rwanda', 'Somalia', 'Somaliland region', 'South Sudan', 'Sudan', 'Suriname', 'Syria', 'Trinidad and Tobago', 'Turkmenistan', 'Türkiye']
In 2024 only: ['Turkiye']


In [7]:
# Harmonize single known mismatch in country naming (2024 uses "Turkiye" instead of "Türkiye")
df_all_years_clean['Country name'] = df_all_years_clean['Country name'].replace({'Türkiye': 'Turkiye'})

# Region mapping from 2024 dataset to historical dataset
region_map = df_2024.set_index('Country name')['Regional indicator'].to_dict()
df_all_years_clean['Regional indicator'] = df_all_years_clean['Country name'].map(region_map)

# Unmatched countries
missing_regions = df_all_years_clean[df_all_years_clean['Regional indicator'].isnull()]['Country name'].unique()
print("Countries without mapped region:", missing_regions)

# Turkiye is now matched correctly. There are 22 countries without region mapping, due to being absent in 2024 data. 

# Drop rows without region mapping for regional analyses
df_all_years_region = df_all_years_clean.dropna(subset=['Regional indicator']).copy()

# Quick sanity check
display(df_all_years_region.isnull().sum())  # Shows 0 NaN in 'Regional indicator' column as expected

Countries without mapped region: ['Angola' 'Belarus' 'Belize' 'Bhutan' 'Burundi' 'Central African Republic'
 'Cuba' 'Djibouti' 'Guyana' 'Haiti' 'Maldives' 'Oman' 'Qatar' 'Rwanda'
 'Somalia' 'Somaliland region' 'South Sudan' 'Sudan' 'Suriname' 'Syria'
 'Trinidad and Tobago' 'Turkmenistan']


Country name                      0
year                              0
Ladder score                      0
Log GDP per capita               19
Social support                    9
Healthy life expectancy          59
Freedom to make life choices     30
Generosity                       44
Perceptions of corruption       106
Positive affect                  17
Negative affect                  11
Regional indicator                0
dtype: int64

For the historical dataset, we mapped the **Regional indicator** from the 2024 dataset to enable consistent regional trend analyses.  

- 22 countries could not be matched because they are absent from the 2024 dataset and therefore lack a regional assignment.  
- Since our research questions focus on **regional trends** (RQ1 & RQ2), these countries are excluded from further analysis.  
- This ensures a consistent country–region mapping across historical and 2024 datasets, even though it reduces coverage slightly.  

The cleaned dataset is therefore restricted to countries with valid regional mapping.

## 4. Save Processed Datasets

We now save the cleaned and harmonized datasets to the `data/processed/` folder.  
These files will be the input for subsequent analysis and modeling.


In [8]:
# Historical datasets
df_all_years_clean.to_csv("../data/processed/world-happiness-historical-all.csv", index=False)
df_all_years_region.to_csv("../data/processed/world-happiness-historical-region.csv", index=False)

# 2024 datasets
df_2024.to_csv("../data/processed/world-happiness-2024.csv", index=False)
df_2024_model.to_csv("../data/processed/world-happiness-2024-model.csv", index=False)

print("Processed datasets successfully saved.")


Processed datasets successfully saved.


We store multiple processed versions to allow flexibility in later analyses:
- **world-happiness-historical-all.csv** → full historical dataset (2006–2023, no 2005, no region drop)  
- **world-happiness-historical-region.csv** → historical dataset restricted to countries with valid regional mapping  
- **world-happiness-2024.csv** → complete 2024 dataset  
- **world-happiness-2024-model.csv** → 2024 dataset cleaned for regression (3 rows dropped with missing predictors)  
