# Data Preprocessing

The goal of this notebook is to prepare both the **historical dataset (2005–2023)** and the **2024 snapshot dataset** for further analysis and modeling.  
While the exploratory data analysis (EDA) provided first descriptive insights, this stage ensures data consistency, handles structural issues, and creates a clean foundation for answering our research questions.  

Key preprocessing objectives:
- Address missingness where relevant (e.g., for regression modeling)
- Harmonize country names and ensure consistent regional mapping across datasets  
- Verify variable selection and clarify which features will be used for historical vs. 2024 analyses    


## 1. Load Datasets

In [23]:
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

In [24]:
path_2024 = "../data/raw/world-happiness-report-2024-yearly-updated/World-happiness-report-2024.csv"
path_all_years = "../data/raw/world-happiness-report-2024-yearly-updated/World-happiness-report-updated_2024.csv"

# Load datasets
df_2024 = pd.read_csv(path_2024) 
df_all_years = pd.read_csv(path_all_years, encoding='latin1')

# Quick sanity check
print('df_2024 shape:', df_2024.shape)
print('df_all_years shape:', df_all_years.shape)

df_2024 shape: (143, 12)
df_all_years shape: (2363, 11)


## 2. Handling Missing Values & Data Cleaning  

As established in the EDA, missingness is **limited but patterned**:  
- 2005 has very few observations → excluded from historical analysis.  
- Three countries in the 2024 dataset (Bahrain, Tajikistan, Palestine) have missing predictors but valid ladder scores → retained for descriptive analyses, excluded from regression modeling.  

No further imputation is required, as coverage is high and missingness is not random.  
We now apply these cleaning steps directly to the datasets.


In [25]:
# Dropping 2005 data due to low coverage
df_all_years_clean = df_all_years[df_all_years['year'] != 2005].copy()

# Quick sanity check
print('df_all_years_clean shape (2005 removed):', df_all_years_clean.shape) # 27 records removed as expected

df_all_years_clean shape (2005 removed): (2336, 11)


In [26]:
# Dropping missing values in 2024 for regression modeling
df_2024_model = df_2024.dropna(axis=0, subset=['Log GDP per capita']) 

# Quick sanity check
print('df_2024_model shape (NaN values removed):', df_2024_model.shape) # 3 rows dropped as expected

df_2024_model shape (NaN values removed): (140, 12)


## 3. Region Mapping

In order to also do regional analyses in the historic dataset, we map the Regional Indicator from the 2024 datset with the historic data based on country names.

In [27]:
# Match country names between datasets (for merging Regional indicator column later)

# Count unique countries in both datasets and identify non-matching entries
set_hist = set(df_all_years_clean['Country name'])
set_2024 = set(df_2024['Country name'])

len_hist, len_2024 = len(set_hist), len(set_2024)
print('Unique countries: hist=', len_hist, ' 2024=', len_2024)

only_in_hist = sorted(list(set_hist - set_2024))
only_in_2024 = sorted(list(set_2024 - set_hist))

print('In historical only:', only_in_hist)
print('In 2024 only:', only_in_2024)


Unique countries: hist= 165  2024= 143
In historical only: ['Angola', 'Belarus', 'Belize', 'Bhutan', 'Burundi', 'Central African Republic', 'Cuba', 'Djibouti', 'Guyana', 'Haiti', 'Maldives', 'Oman', 'Qatar', 'Rwanda', 'Somalia', 'Somaliland region', 'South Sudan', 'Sudan', 'Suriname', 'Syria', 'Trinidad and Tobago', 'Turkmenistan', 'Türkiye']
In 2024 only: ['Turkiye']


In [28]:
# Harmonize single known mismatch in country naming (2024 uses "Turkiye" instead of "Türkiye")
df_all_years_clean['Country name'] = df_all_years_clean['Country name'].replace({'Türkiye': 'Turkiye'})

# Region mapping from 2024 dataset to historical dataset
region_map = df_2024.set_index('Country name')['Regional indicator'].to_dict()
df_all_years_clean['Regional indicator'] = df_all_years_clean['Country name'].map(region_map)

# Unmatched countries
missing_regions = df_all_years_clean[df_all_years_clean['Regional indicator'].isnull()]['Country name'].unique()
print("Countries without mapped region:", missing_regions)

# Turkiye is now matched correctly. There are 22 countries without region mapping, due to being absent in 2024 data. 

Countries without mapped region: ['Angola' 'Belarus' 'Belize' 'Bhutan' 'Burundi' 'Central African Republic'
 'Cuba' 'Djibouti' 'Guyana' 'Haiti' 'Maldives' 'Oman' 'Qatar' 'Rwanda'
 'Somalia' 'Somaliland region' 'South Sudan' 'Sudan' 'Suriname' 'Syria'
 'Trinidad and Tobago' 'Turkmenistan']


In [34]:
# Distribution of missing regions in historical data

mask_nan = df_all_years_clean['Regional indicator'].isnull()
df_all_years_clean[mask_nan]['Country name'].value_counts()

Country name
Belarus                     14
Rwanda                      12
Haiti                       11
Turkmenistan                10
Syria                        7
Qatar                        5
Trinidad and Tobago          5
Burundi                      5
Central African Republic     5
Sudan                        5
South Sudan                  4
Somaliland region            4
Angola                       4
Djibouti                     4
Somalia                      3
Bhutan                       3
Belize                       2
Maldives                     1
Guyana                       1
Cuba                         1
Suriname                     1
Oman                         1
Name: count, dtype: int64